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' Abstract 



Relative to the large literature on upper bounds on complexity of convex optimization, lesser 
attention has been paid to the fundamental hardness of these problems. Given the extensive 
use of convex optimization in machine learning and statistics, gaining an understanding of these 
complexity-theoretic issues is important. In this paper, we study the complexity of stochastic 
convex optimization in an oracle model of computation. We improve upon known results and 
C*") ■ obtain tight minimax complexity estimates for various function classes. 

> 

■ 1 Introduction 

O 

Convex optimization forms the backbone of many algorithms for statistical learning and estima- 
tion. Given that many statistical estimation problems are large-scale in nature — with the problem 
dimension and/or sample size being large — it is essential to make efficient use of computational 
resources. Stochastic optimization algorithms are an attractive class of methods, known to yield 
moderately accurate solutions in a relatively short time pp. Given the popularity of such stochas- 
tic optimization methods, understanding the fundamental computational complexity of stochastic 
convex optimization is thus a key issue for large-scale learning. A large body of literature is devoted 
to obtaining rates of convergence of specific procedures for various classes of convex optimization 
problems. A typical outcome of such analysis is an upper bound on the error — for instance, gap to 
the optimal cost — as a function of the number of iterations. Such analyses have been performed 
for many standard optimization algorithms, among them gradient descent, mirror descent, interior 
point programming, and stochastic gradient descent, to name a few. We refer the reader to various 
standard texts on optimization (e.g., [21 OS]) for further details on such results. 

On the other hand, there has been relatively little study of the inherent complexity of con- 
vex optimization problems. To the best of our knowledge, the first formal study in this area was 
undertaken in the seminal work of Nemirovski and Yudin [5], hereafter referred to as NY. One 
obstacle to a classical complexity-theoretic analysis, as these authors observed, is that of casting 
convex optimization problems in a Turing Machine model. They avoided this problem by instead 
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considering a natural oracle model of complexity, in which at every round the optimization proce- 
dure queries an oracle for certain information on the function being optimized. This information 
can be either noiseless or noisy, depending on whether the goal is to lower bound the oracle com- 
plexity of deterministic or stochastic optimization algorithms. Working within this framework, the 
authors obtained a series of lower bounds on the computational complexity of convex optimization 
problems, both in deterministic and stochastic settings. In addition to the original text NY [5], we 
refer the interested reader to the book by Nesterov [J], and the lecture notes by Nemirovski [6] for 
further background. 

In this paper, we consider the computational complexity of stochastic convex optimization 
within this oracle model. In particular, we improve upon the work of NY [5] for stochastic convex 
optimization in two ways. First, our lower bounds have an improved dependence on the dimension 
of the space. In the context of statistical estimation, these bounds show how the difficulty of the 
estimation problem increases with the number of parameters. Second, our techniques naturally 
extend to give sharper results for optimization over simpler function classes. We show that the 
complexity of optimization for strongly convex losses is smaller than that for convex, Lipschitz 
losses. Third, we show that for a fixed function class, if the set of optimizers is assumed to have 
special structure such as sparsity, then the fundamental complexity of optimization can be signif- 
icantly smaller. All of our proofs exploit a new notion of the discrepancy between two functions 
that appears to be natural for optimization problems. They involve a reduction from a statisti- 
cal parameter estimation problem to the stochastic optimization problem, and an application of 
information-theoretic lower bounds for the estimation problem. We note that special cases of the 
first two results in this paper appeared in the extended abstract [7J, and that a related study was 
independently undertaken by Raginsky and Rakhlin [8]. 

The remainder of this paper is organized as follows. We begin in Section [2] with background on 
oracle complexity, and a precise formulation of the problems addressed in this paper. Section [3] is 
devoted to the statement of our main results, and discussion of their consequences. In Section [H 
we provide the proofs of our main results, which all exploit a common framework of four steps. 
More technical aspects of these proofs are deferred to the appendices. 



Notation: For the convenience of the reader, we collect here some notation used throughout the 
paper. For p G [1, oo], we use to denote the £ p -norm of a vector x G W, and we let q denote 
the conjugate exponent, satisfying - + ~ = 1. For two distributions P and Q, we use D(F \\Q) to 
denote the Kullback-Leibler (KL) divergence between the distributions. The notation 1(A) refers 
to the 0-1 valued indicator random variable of the set A. For two vectors a,/3 G {— l,+l} d , we 
define the Hamming distance A#(a, /3) := X^f=i^[ a « ^ A]- Given a convex function / : M. d — > R, 
the subdifferential of / at x is the set df(x) := {z G M d \ f(y) > f(x) + (z, y — x) for all y G M. d }. 

2 Background and problem formulation 

We begin by introducing background on the oracle model of convex optimization, and then turn to 
a precise specification of the problem to be studied. 
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2.1 Convex optimization in the oracle model 

Convex optimization is the task of minimizing a convex function / over a convex set § C R rf . As- 
suming that the minimum is achieved, it corresponds to computing an element x*j that achieves the 
minimum — that is, an element x*j G argmin^gs f(x). An optimization method is any procedure that 
solves this task, typically by repeatedly selecting values from S. For a given class of optimization 
problems, our primary focus in this paper is to determine lower bounds on the computational cost, 
as measured in terms of the number of (noisy) function and subgradient evaluations, required to 
obtain an e-optimal solution to any optimization problem within the class. 

More specifically, we follow the approach of Nemirovski and Yudin [5j, and measure computa- 
tional cost based on the oracle model of optimization. The main components of this model are an 
oracle and an information set. An oracle is a (possibly random) function : S i— > I that answers 
any query x G S by returning an element 4>(x) in an information set I. The information set varies 
depending on the oracle; for instance, for an exact oracle of m th order, the answer to a query xt 
consists of f(xt) and the first m derivatives of / at Xf. For the case of stochastic oracles studied 
in this paper, these values are corrupted with zero-mean noise with bounded variance. We then 
measure the computational labor of any optimization method as the number of queries it poses to 
the oracle. 

In particular, given a positive integer T corresponding to the number of iterations, an optimiza- 
tion method Ai designed to approximately minimize the convex function / over the convex set § 
proceeds as follows. At any given iteration t = 1, . . . ,T, the method Ai queries at xt G S, and the or- 
acle reveals the information 4>{ x u /)• The method then uses the information {<f>{x\, /),..., (f>(xt, /)} 
to decide at which point xt+\ the next query should be made. For a given oracle function 0, let My 
denote the class of all optimization methods Ai that make T queries according to the procedure 
outlined above. For any method Ai G My, we define its error on function / after T steps as 



where xt is the method's query at time T. Note that by definition of x*j as a minimizing argument, 
this error is a non-negative quantity. 

When the oracle is stochastic, the method's query xt at time T is itself random, since it depends 
on the random answers provided by the oracle. In this case, the optimization error 6t(A4, /, S, 4>) 
is also a random variable. Accordingly, for the case of stochastic oracles, we measure the accuracy 
in terms of the expected value E^[e-r(A / l, /, S, <fi)], where the expectation is taken over the oracle 
randomness. Given a class of functions T defined over a convex set § and a class My of all 
optimization methods based on T oracle queries, we define the minimax error 



In the sequel, we provide results for particular classes of oracles. So as to ease the notation, when 
the oracle <j) is clear from the context, we simply write e^(J-, §). 

2.2 Stochastic first-order oracles 

In this paper, we study stochastic oracles for which the information set X C R x R rf consists of pairs 
of noisy function and subgradient evaluations. More precisely, we have: 



e T (M,f,§,<f>) := f(x T )-mmf(x) = f(x T )-f(x* f ) 





supE^[e T (M,/,S,0)]. 
/e-F 



(2) 
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Definition 1. For a given set S and function class T , the class of first- order stochastic oracles 
consists of random mappings (ft : S x T — > I of the form 4>(x, f) = (f(x), such that 

E[f(x)}= f(x), E[z(x)] G df(x), and E[||%)||*] < a 2 . (3) 

We use P)CT to denote the class of all stochastic first-order oracles with parameters (p, a). Note that 
the first two conditions imply that f(x) is an unbiased estimate of the function value f(x), and that 
'z(x) is an unbiased estimate of a subgradient z G df(x). When / is actually differentiable, then 
'z(x) is an unbiased estimate of the gradient V/(x). The third condition in equation ([3]) controls 
the "noisiness" of the subgradient estimates in terms of the £ p -norm. 

Stochastic gradient methods are a widely used class of algorithms that can be understood as 
operating based on information provided by a stochastic first-order oracle. As a particular example, 
consider a function of the separable form f(x) = ^Y17=i^i( x )^ wnere each hi is differentiable. 
Functions of this form arise very frequently in statistical problems, where each term i corresponds 
to a different sample and the overall cost function is some type of statistical loss (e.g., maximum 
likelihood, support vector machines, boosting etc.) The natural stochastic gradient method for this 
problem is to choose an index i G {1,2, . . . ,n} uniformly at random, and then to return the pair 
(hi(x),Vhi(x)). Taking averages over the randomly chosen index i yields ^ Y^7=i ^i(x) = f(x), so 
that hi(x) is an unbiased estimate of f(x), with an analogous unbiased property holding for the 
gradient of hi(x). 

2.3 Function classes of interest 

We now turn to the classes T of convex functions for which we study oracle complexity. In all 
cases, we consider real-valued convex functions defined over some convex set §. We assume without 
loss of generality that § contains an open set around 0, and many of our lower bounds involve the 
maximum radius r = r(S) > such that 

S 5 Boo(r) := {xeK d | \\x\U < r}. (4) 

Our first class consists of convex Lipschitz functions: 

Definition 2. For a given convex set § C M. d and parameter p G [l,oo] ; the class J- CV (S, L,p) 
consists of all convex functions f : S — >• R such that 

\f(x)-f{y)\<L\\x-y\\ q forallx,y£§, (5) 

where | = 1 - ~ . 

We have defined the Lipschitz condition ([5]) in terms of the conjugate exponent q G [l,oo], 
defined by the relation ^ = 1— ~. To be clear, our motivation in doing so is to maintain consistency 

with our definition of the stochastic first-order oracle, in which we assumed that E [ Wz(x) \\p ] < a 2 . 
We note that the Lipschitz condition ([SJ is equivalent to the condition 

\\z\\ p < L Vz G df(x), and for all x G int(S). 
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If we consider the case of a differentiable function /, the unbiasedness condition in Definition Q] 
implies that 

(a) (b) i 

\\Vf{x)\\ p =\\nz{x)\\\ p < E\\z(x)\\ p < ^/nz(xW p < a, 

where inequality (a) follows from the convexity of the £ p -norm and Jensen's inequality, and inequal- 
ity (b) is a result of Jensen's inequality applied to the concave function y/x. This bound implies 
that / must be Lipschitz with constant at most a with respect to the dual £ g -norm. Therefore, we 
necessarily must have L < a, in order for the function class from Definition [2] to be consistent with 
the stochastic first-order oracle. 



A second function class consists of strongly convex functions, defined as follows: 

Definition 3. For a given convex set § C M. d and parameter p £ [l,oo], the class J- scv (S,p; L, 7) 
consists of all convex functions /:§—)• R such that the Lipschitz condition ([5]) holds, and such 
that f satisfies the l2-strong convexity condition 

7 2 2 

f{ax+ (1 - a)y) > af(x) + (1 - a)f(y) + a(l - a) — \\x -y\\ 2 forallx,y£S. (6) 

In this paper, we restrict our attention to the case of strong convexity with respect to the 
^2-norm. (Similar results on the oracle complexity for strong convexity with respect to different 
norms can be obtained by straightforward modifications of the arguments given here). For future 
reference, it should be noted that the Lipschitz constant L and strong convexity constant 7 interact 
with one another. In particular, whenever Scl'' contains the £oo-ball of radius r, the Lipschitz L 
and strong convexity 7 constants must satisfy the inequality 

In order to establish this inequality, we note that strong convexity condition with a = 1/2 implies 
that 

7 2 ^ 2/ gg) _ _ fiy) ^ L \\ X _y\\ q 

8 - 2\\x-y\\l ~ 2\\x-y\\l 

We now choose the pair x,y G S such that \\x — y\\oo = r and \\x — y\\ 2 = ry/d. Such a choice is 
possible whenever S contains the ball of radius r. Since we have \\x — y\\ q < — y||oo> this 

2 --1 

choice yields \- < — , which establishes the claim ([7]). 



As a third example, we study the oracle complexity of optimization over the class of convex 
functions that have sparse minimizers. This class of functions is well-motivated, since a large body 
of statistical work has studied the estimation of vectors, matrices and functions under various types 
of sparsity constraints. A common theme in this line of work is that the ambient dimension d enters 
only logarithmically, and so has a mild effect. Consequently, it is natural to investigate whether 
the complexity of optimization methods also enjoys such a mild dependence on ambient dimension 
under sparsity assumptions. 

For a vector x G M d , we use ||ic||o to denote the number of non-zero elements in x. Recalling 
the set J-* CV (S, L,p) from Definition [21 we now define a class of Lipschitz functions with sparse 
minimizers. 
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Definition 4. For a convex set S C M rf and positive integer k < [d/2\ , let 

J r sp (/c;S,L) := {/ G .F CV (S, .L, oo) | 3 x* G argmin/(x) satisfying \\x*\\q < k.} (8) 

6e £/ie c/ass o/ a// convex functions that are L-Lipschitz in the ioo-norm, and have at least one 
k-sparse optimizer. 

We frequently use the shorthand notation J r sp (/c) when the set § and parameter L are clear from 
context. 



3 Main results and their consequences 

With the setup of stochastic convex optimization in place, we are now in a position to state the 
main results of this paper, and to discuss some of their consequences. As previously mentioned, a 
subset of our results assume that the set S contains an ball of radius r — r(S). Our bounds 
scale with r, thereby reflecting the natural dependence on the size of the set S. Also, we set the 
oracle second moment bound a to be the same as the Lipschitz constant L in our results. 



3.1 Oracle complexity for convex Lipschitz functions 

We begin by analyzing the minimax oracle complexity of optimization for the class of bounded and 
convex Lipschitz functions J-" cv from Definition [2j 

Theorem 1. Let § Cl** be a convex set such that § D B^r) for some r > 0. Then for a universal 
constant cq > 0, the minimax oracle complexity over the class J- CV (E>, L,p) satisfies the following 
lower bounds: 



(a) For\<p<2, 



sup &r(.7 r cv ,S;0) > mini c^Lr J ^, ^ i . (9) 



p.L 



(b) For p > 2, 



L<i 1_1 / p rl 

c ° Lr ~ 7^T' ^2 (' ^ 



t>&Op,L 



Remarks: Nemirovski and Yudin [5] proved the lower bound ^(~7y) f° r the function class J- cy , 
in the special case that S is the unit ball of a given norm, and the functions are Lipschitz in the 
corresponding dual norm. For p > 2, they established the minimax optimality of this dimension- 
independent result by appealing to a matching upper bound achieved by the method of mirror 
descent. In contrast, here we do not require the two norms — namely, that constraining the set § 
and that for the Lipschitz constraint — to be dual to one other; instead, we give give lower bounds in 
terms of the largest 1^ ball contained within the constraint set S. As discussed below, our bounds 
do include the results for the dual setting of past work as a special case, but more generally, by 
examining the relative geometry of an arbitrary set with respect to the £oo ball, we obtain results for 
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arbitrary sets. (We note that the ioo constraint is natural in many optimization problems arising 
in machine learning settings, in which upper and lower bounds on variables are often imposed.) 
Thus, in contrast to the past work of NY on stochastic optimization, our analysis gives sharper 
dimension dependence under more general settings. It also highlights the role of the geometry of 
the set § in determining the oracle complexity. 

In general, our lower bounds cannot be improved, and hence specify the optimal minimax oracle 
complexity. We consider here some examples to illustrate their sharpness. Throughout we assume 
that T is large enough to ensure that the 1/vT term attains the lower bound and not the L/144 
term. (This condition is reasonable given our goal of understanding the rate as T increases, as 
opposed to the transient behavior over the first few iterations.) 

(a) We start from the special case that has been primarily considered in past works. We consider 
the class J 7 cv (B g (l), L,p) with q = 1 — 1/p and the stochastic first-order oracles Pj l for this 
class. Then the radius r of the largest ball inscribed within the B g (l) scales as r = dr 1 ^ . 
By inspection of the lower bounds bounds ([9]) and (fTOl) . we see that 



As mentioned previously, the dimension-independent lower bound for the case p > 2 was 
demonstrated in Chapter 5 of NY, and shown to be optimaQ since it is achieved using mirror 
descent with the prox-function || • \\ q . For the case of 1 < p < 2, the lower bounds are also 
unimprovable, since they are again achieved (up to constant factors) by stochastic gradient 
descent. See Appendix O for further details on these matching upper bounds. 

(b) Let us now consider how our bounds can also make sharp predictions for non-dual geometries, 
using the special case § = Boo(l). For this choice, we have r(S) = 1, and hence Theorem [1] 
implies that for all p £ [1,2], the minimax oracle complexity is lower bounded as 



This lower bound is sharp for all p£ [1, 2]. Indeed, for any convex set §, stochastic gradient 
descent achieves a matching upper bound (see Section 5.2.4, p. 196 of NY [5], as well as 
Appendix O in this paper for further discussion). 

(c) As another example, suppose that § = B2(l). Observe that this ^-norm unit ball satisfies the 
relation ©2(1) ^ ^Boo(l), so that we have r(B2(l)) = \j\fd. Consequently, for this choice, 
the lower bound ([9]) takes the form 



which is a dimension-independent lower bound. This lower bound for B2(l) is indeed tight 
for p £ [1, 2], and as before, this rate is achieved by stochastic gradient descent [5]. 

lr There is an additional logarithmic factor in the upper bounds for p = f2(logd). 




for 1 < p < 2 
for p > 2. 



(11) 
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(d) Turning to the case of p > 2, when § = B DO (l), the lower bound (fTU|) can be achieved (up 
to constant factors) using mirror descent with the dual norm || • \\ q ; for further discussion, 
we again refer the reader to Section 5.2.1, p. 190 of NY [5], as well as to Appendix[C]of this 
paper. Also, even though this lower bound requires the oracle to have only bounded variance, 
our proof actually uses a stochastic oracle based on Bernoulli random variables, for which all 
moments exist. Consequently, at least in general, our results show that there is no hope of 
achieving faster rates by restricting to oracles with bounds on higher-order moments. This is 
an interesting contrast to the case of having less than two moments, in which the rates are 
slower. For instance, as shown in Section 5.3.1 of NY [5], suppose that the gradient estimates 
in a stochastic oracle satisfy the moment bound E||z(x)||p < a 2 for some b 6 [1,2). In this 

setting, the oracle complexity is lower bounded by J7(T _( - b_1 ^ fe ) . Since T~s~ <C for all 
b G [1,2), there is a significant penalty in convergence rates for having less than two bounded 
moments. 

(e) Even though the results have been stated in a first-order stochastic oracle model, they actually 
hold in a stronger sense. Let V t f(x) denote the z^-order derivative of / evaluated at x, when 
it exists. With this notation, our results apply to an oracle that responds with a random 
function ft such that 

E[/ ( (x)] = E[f(x)}, and E\V* f t (x)] = V* f (x) for all x £ § and i such that V l f(x) exists, 

along with appropriately bounded second moments of all the derivatives. Consequently, 
higher-order gradient information cannot improve convergence rates in a worst-case setting. 
Indeed, the result continues to hold even for the significantly stronger oracle that responds 
with a random function that is a noisy realization of the true function. In this sense, our result 
is close in spirit to a statistical sample complexity lower bound. Our proof technique is based 
on constructing a "packing set" of functions, and thus has some similarity to techniques used 
in statistical minimax analysis (e.g., [9j [THl QTJ [12] ) and learning theory (e.g., [131 CES] ) ■ A 
significant difference, as will be shown shortly, is that the metric of interest for optimization 
is very different than those typically studied in statistical minimax theory. 



3.2 Oracle complexity for strongly convex Lipschitz functions 

We now turn to the statement of lower bounds over the class of Lipschitz and strongly convex 
functions T scv from Definition [3l In all these statements, we assume that -y 2 < ALd r /p , as is 
required for the definition of J 7 ^ to be sensible. 

Theorem 2. Let 8 = B^r). Then there exist universal constants ci, C2 > such that the minimax 
oracle complexity over the class J r scv (§,p; L, 7) satisfies the following lower bounds: 

(a) For p = 1, we have 

{1? [d 1? Lr\ 

c ^ ^\jr lists' Til ■ (12) 



» P ,L 

(b) For p > 2, we have: 



( Ltd 1 - 2 '? Lrd 1 - 1 ^ Ltd 1 - 2 '* W-VpX 
sup 6 (JW) > min c 2 ^_ , -3^5-, I • (13) 



p.L 
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As with Theorem [H these lower bounds are sharp. In particular, for S = 1800(1), stochastic 
gradient descent achieves the rate (I12p up to logarithmic factors [16] , and closely related algorithms 
proposed in very recent works [17\ [T8] match the lower bound exactly up to constant factors. It 
should be noted Theorem [2] exhibits an interesting phase transition between two regimes. On one 
hand, suppose that the strong convexity parameter 7 2 is large: then as long as T is sufficiently 
large, the first term Q(l/T) determines the minimax rate, which corresponds to the fast rate possible 
under strong convexity. In contrast, if we consider a poorly conditioned objective with 7 ~ 0, then 
the term involving Q(l/y/T) is dominant, corresponding to the rate for a convex objective. This 
behavior is natural, since Theorem [2] recovers (as a special case) the convex result with 7 = 0. 
However, it should be noted that Theorem [2] applies only to the set Boo(r), and not to arbitrary 
sets § like Theorem [TJ Consequently, the generalization of Theorem [2] to arbitrary convex, compact 
sets remains an interesting open question. 

3.3 Oracle complexity for convex Lipschitz functions with sparse optima 

Finally, we turn to the oracle complexity of optimization over the class J- sp from Definition [H 

Theorem 3. Let J-" sp be the class of all convex functions that are L-Lipschitz with respect to the 
|| • || oq norm and that have a k-sparse optimizer. Let S C M rf be a convex set with Boo(r) C S. Then 
there exists a universal constant cq > such that for all k < |_f J , we have 



sup '(^^O^niinlcoW^l,-^!. (11) 



T 1 432 



Remark: If k = 0(d 1 ^ s ) for some 5 6 (0, 1) (so that log | = 0(logd)), then this bound is sharp 
up to constant factors. In particular, suppose that we use mirror descent based on the || • ||i+ e norm 
with e = 2 log dj (2 log d — 1). As we discuss in more detail in Appendix[Cl it can be shown that this 

technique will achieve a solution accurate to 0(yj ^-^^) within T iterations; this achievable result 

matches our lower bound (|14p up to constant factors under the assumed scaling k = 0(d 1 ^ s ) . To 
the best of our knowledge, Theorem [3] provides the first tight lower bound on the oracle complexity 
of sparse optimization. 



4 Proofs of results 

We now turn to the proofs of our main results. We begin in Section \4. II by outlining the framework 
and establishing some basic results on which our proofs are based. Sections 14.21 through 14.41 are 
devoted to the proofs of Theorems Q] through [3] respectively. 



4.1 Framework and basic results 

We begin by establishing a basic set of results that are exploited in the proofs of the main results. 
At a high-level, our main idea is to show that the problem of convex optimization is at least as 
hard as estimating the parameters of Bernoulli variables — that is, the biases of d independent coins. 
In order to perform this embedding, for a given error tolerance e, we start with an appropriately 
chosen subset of the vertices of a d-dimensional hypercube, each of which corresponds to some 
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values of the d Bernoulli parameters. For a given function class, we then construct a "difficult" 
subclass of functions that are indexed by these vertices of the hypercube. We then show that being 
able to optimize any function in this subclass to e-accuracy requires identifying the hypercube 
vertex. This is a multiway hypothesis test based on the observations provided by T queries to the 
stochastic oracle, and we apply Fano's inequality [19] or Le Cam's bound [201 02] to lower bound 
the probability of error. In the remainder of this section, we provide more detail on each of steps 
involved in this embedding. 

4.1.1 Constructing a difficult subclass of functions 

Our first step is to construct a subclass of functions Q C T that we use to derive lower bounds. 
Any such subclass is parametrized by a subset V C {—1, +1}^ of the hypercube, chosen as follows. 
Recalling that A# denotes the Hamming metric, we let V = {a 1 ,..., a } be a subset of the 
vertices of the hypercube such that 

A H (a j , a k ) > j for all j ^ k, (15) 

meaning that V is a ^-packing in the Hamming norm. It is a classical fact (e.g., [21]) that one can 
construct such a set with cardinality |V| > (2/y/e) d / 2 . 

Now let Gbase = i = 1, . . . , d} denote some base set of 2d functions defined on the 

convex set S, to be chosen appropriately depending on the problem at hand. For a given tolerance 
5 E (0, 1], we define, for each vertex a E V, the function 

d 

g a (x) := - d £ {(1/2 + aiS)ft(x) + (1/2 - a % 5) fr(x)}. (16) 

i=i 

Depending on the result to be proven, our choice of the base functions ff} and the pre-factor 
c will ensure that each g a satisfies the appropriate Lipschitz and/or strong convexity properties 
over S. Moreover, we will ensure that that all minimizers x a of each g a are contained within S. 
Based on these functions and the packing set V, we define the function class 

0(5) :={g a , aE V}. (17) 

Note that 0(5) contains a total of |V| functions by construction, and as mentioned previously, our 
choices of the base functions etc. will ensure that 0(5) C J 7 . We demonstrate specific choices of 
the class G(S) in the proofs of Theorems [1] through [3] to follow. 

4.1.2 Optimizing well is equivalent to function identification 

We now claim that if a method can optimize over the subclass 0(5) up to a certain tolerance, then 
it must be capable of identifying which function g a E 0(5) was chosen. We first require a measure 
for the closeness of functions in terms of their behavior near each others' minima. Recall that we 
use x*j E M. d to denote a minimizing point of the function /. Given a convex set SCK d and two 
functions /, g, we define 

p(f, g) := inf [f(x) + g(x) - f(x}) - g(x* g )] . (18) 
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Figure 1. Illustration of the discrepancy function p(f,g). The functions / and g achieve their 
minimum values f(x*f) and g{x*) at the points x*^ and x* respectively. 



This discrepancy measure is non-negative, symmetric in its arguments, and satisfies p(f,g) = 
if and only if x*j = x*, so that we may refer to it as a premetric. (It does not satisfy the triangle 
inequality nor the condition that p(f, g) = if and only if / = g, both of which are required for p 
to be a metric.) 

Given the subclass G(S), we quantify how densely it is packed with respect to the premetric p 
using the quantity 



i>{G{8)) ■= mm p(g a ,gp). 



(19) 



We denote this quantity by ij)(S) when the class Q is clear from the context. We now state a simple 
result that demonstrates the utility of maintaining a separation under p among functions in Q{S). 



Lemma 1. For any x 6 S, there can be at most one function g a G G(d~) such that 

g a (x) - m.ig a (x) < t^l. 

ices o 



(20) 



Thus, if we have an element i£§ that approximately minimizes one function in the set G{6~) up 
to tolerance ip{b~), then it cannot approximately minimize any other function in the set. 

Proof. For a given i£§, suppose that there exists an a £ V such that g a {x) — g a {x* a ) < 
From the definition of ip(5) in (|19[) . for any (3 £ V, f3 ^ a, we have 

/ ( %\ 

< g a {x) - inf g a (x) + gp(x) - inf g p {x) < — — + gp(x) - inf gp{x). 
Re-arranging yields the inequality gp(x) — gp{xV) > |^((5), from which the claim (|20p follows. 

□ 

Suppose that for some fixed but unknown function g a * £ G{$), some method Mt is allowed to 
make T queries to an oracle with information function </>(•; g a »), thereby obtaining the information 
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sequence 

<f>(xj;g* a ) :={<t>(x t ;g*),t = l,2,...,T}. 

Our next lemma shows that if the method Mt achieves a low minimax error over the class G(5), 
then one can use its output to construct a hypothesis test that returns the true parameter a* 
at least 2/3 of the time. (In this statement, we recall the definition ([2]) of the minimax error in 
optimization.) 

Lemma 2. Suppose that based on the data <fi(x\ '; <?*) , there exists a method A4t that achieves a 
minimax error satisfying 

E[e T (M T ,g(S),§,<j>)) <^. (21) 

y 

Based on such a method Mt> one can construct a hypothesis test a : ^(x^;g*) — > V such that 
maxPflJa! ^ a*] < A. 

Proof. Given a method Mt that satisfies the bound (|21j) . we construct an estimator o,(Mt) of the 
true vertex a* as follows. If there exists some a G V such that g a {xr) — 9a{x a ) < then we 
set S(A^r) equal to a. If no such a exists, then we choose o.(Mt) uniformly at random from V. 
From Lemma [TJ there can exist only one such a 6 V that satisfies this inequality. Consequently, 
using Markov's inequality, we have F^aiMx) + a*} < [e T {M T , g a * , S, <j>) > if>(S)/3] < |. 
Maximizing over a* completes the proof. □ 

We have thus shown that having a low minimax optimization error over G(§) implies that the vertex 
a* £ V can be identified most of the time. 

4.1.3 Oracle answers and coin tosses 

We now describe stochastic first order oracles <fi for which the samples 4>(xJ;g a ) can be related to 
coin tosses. In particular, we associate a coin with each dimension i G {1,2, ... ,d}, and consider 
the set of coin bias vectors lying in the set 

0(5) = {(1/2 + ai*, . . . , 1/2 + a d 5) \ a G V}, (22) 

Given a particular function g a G 0(5) — or equivalently, vertex a G V — we consider two different 
types of stochastic first-order oracles 4>, defined as follows: 



Oracle A: 1-dimensional unbiased gradients 

(a) Pick an index i G {1, . . . , d} uniformly at random. 

(b) Draw hi G {0, 1} according to a Bernoulli distribution with parameter 1/2 + cti5. 

(c) For the given input x G S, return the value g a> A.(x) and a sub-gradient 
z a ,A(x) G &g aj A{x) of the function 

9a,A := c[biff + (l-bi)fr]. 
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By construction, the function value and gradients returned by Oracle A are unbiased estimates 
of those of g a . In particular, since each co-ordinate i is chosen with probability 1/d, we have 

d 

E[?«,A(a;)] =^[EN^(i) + E[l-i i ]/ i -(i)] = g a (x), 

i=l 

with a similar relation for the gradient. Furthermore, as long as the base functions and f~ have 
gradients bounded by 1, we have E[|| z QjJ 4(x)|| p ] < c for all p G [1, oo]. 

Parts of proofs are based on an oracle which responds with function values and gradients that 
are d- dimensional in nature. 



Oracle B: d-dimensional unbiased gradients 

(a) For i = l,...,d, draw 6, G {Oil} according to a Bernoulli distribution with 
parameter 1/2 + aid. 

(b) For the given input x G S, return the value g a ,B(x) and a sub-gradient 
^oi,b(x) £ dg a> B(x) of the function 

d 

a** :=s£[fc/* + + (i-wr]- 



As with Oracle A, this oracle returns unbiased estimates of the function values and gradients. 
We frequently work with functions f^,f~ that depend only on the i th coordinate x(i). In such 

cases, under the assumptions < 1 and |^t|j| < 1, we have 



| 2 Q,B( 2; )||p — ^2 



2 / d 



,i=i 



dx(i) dx(i) 



2/p 



< c 2 d 2 /P- 2 . (23) 



In our later uses of Oracles A and B, we choose the pre-factor c appropriately so as to produce the 
desired Lipschitz constants. 



4.1.4 Lower bounds on coin-tossing 

Finally, we use information-theoretic methods to lower bound the probability of correctly estimating 
the true parameter a* £ V in our model. At each round of either Oracle A or Oracle B, we can 
consider a set of d coin tosses, with an associated vector 9* = + ol\8, . . . , \ + a* d 8) of parameters. 
At any round, the output of Oracle A can (at most) reveal the instantiation b{ G {0, 1} of a randomly 
chosen index, whereas Oracle B can at most reveal the entire vector (pi, 62, . . . , b^). Our goal is to 
lower bound the probability of estimating the true parameter a*, based on a sequence of length 
T. As noted previously in remarks following Theorem [TJ this part of our proof exploits classical 
techniques from statistical minimax theory, including the use of Fano's inequality (e.g., [91 110 | [TT \ 
Q2]) and Le Cam's bound (e.g., [201 G2]). 
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Lemma 3. Suppose that the Bernoulli parameter vector a* is chosen uniformly at random from 
the packing set V , and suppose that the outcome of £ < d coins chosen uniformly at random is 
revealed at each round t = 1, . . . , T. Then for any 5 6 (0, 1/4], any hypothesis test a satisfies 

PEM«-1>1- T*+* 2 . (24) 
| Iog(2/Ve) 

where the probability is taken over both randomness in the oracle and the choice of a* . 



Note that we will apply the lower bound (|24p with £ = 1 in the case of Oracle A, and £ = d in the 
case of Oracle B. 

Proof. For each time t = 1, 2, . . . , T, let Ut denote the randomly chosen subset of size £, Xti be the 
outcome of oracle's coin toss at time t for coordinate i and let Yt € {—1, 0, l} d be a random vector 
with entries 



Y t y- 

By Fano's inequality [19J, we have the lower bound 



Xt t i if i € Ut, and 
-1 if t g I7 t . 



p^iai-^'^yf + log2 , 

log |V| 

where I({(Ut,Yt}f = i, a*) denotes the mutual information between the sequence {(Ut,Yt)}f =1 and 
the random parameter vector a* . As discussed earlier, we are guaranteed that log |V| > | log(2/-y/e). 
Consequently, in order to prove the lower bound (I24j) . it suffices to establish the upper bound 
I({U t ,Y t }f =1 ;a*) < 16T£5 2 . 

By the independent and identically distributed nature of the sampling model, we have 

T 

I{{{U 1 ,Y 1 ),...,{U T ,Y T ))-a*) = Y J I{{U t ,Y t )-a*) = T I((Ui,Yi); a*), 

t=\ 

so that it suffices to upper bound the mutual information for a single round. To simplify notation, 
from here onwards we write (Y, U) to mean the pair (Yi, U\). With this notation, the remainder of 
our proof is devoted to establishing that I(Y; U) < 16 £ 8 2 , 
By chain rule for mutual information 10] . we have 

I((U, Y); a*) = I(Y; a* | U) + I(a*; U). (25) 

Since the subset U is chosen independently of a*, we have I (a*; U) = 0, and so it suffices to upper 
bound the first term. By definition of conditional mutual information [19], we have 

Y\a*,U II IY|t/)] 

Since a has a uniform distribution over V, we have Fy\u = Tyi YlaeV^Yl^Ui an d convexity of the 
Kullback-Leibler (KL) divergence yields the upper bound 

D(P Y \or,u II V Y \u) ^TjJiYl D ( F Y\a*,U II V Y \ a ,u)' (26) 
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Now for any pair a*, a G V, the KL divergence D(Py\ a * t u \\ ^Y\a,u) can be at most the KL 
divergence between i independent pairs of Bernoulli variates with parameters ^ + 5 and ^ — 5. 
Letting D(5) denote the Kullback-Leibler divergence between a single pair of Bernoulli variables 
with parameters \ + 5 and \ — 5, a little calculation yields 

= 2*log(l + T -^ 

85 2 



1-25 

Consequently, as long as 5 < 1/4, we have D(5) < 165 2 . Returning to the bound (|26p . we conclude 
that D(F Y \ a *,u II ^Y\u) < !6^ 5 '■ Taking averages over U, we obtain the bound I(Y; a* \ U) < 16 I 5 2 , 
and applying the decomposition ([25]) yields I((U, Y); a*) < 1Q£5 2 , thereby completing the proof. □ 

The reader might have observed that Fano's inequality yields a non-trivial lower bound only 
when |V| is large enough. Since |V| depends on the dimension d for our construction, we can 
apply the Fano lower bound only for d large enough. Smaller values of d can be lower bounded 
by reduction to the case d = 1; here we state a simple lower bound for estimating the bias of a 
single coin, which is a straightforward application of Le Cam's bounding technique |20tll2|. In this 
special case, we have V = {1/2 + 5, 1/2 — 5}, and we recall that the estimator 2(A^t) takes values 
in V. 

Lemma 4. Given a sample size T > 1 and a parameter a* £ V, let {X±, . . . , Xt} be T i.i.d 
Bernoulli variables with parameter a* . Let a be any test function based on these samples and 
returning an element ofV. Then for any 5 G (0, 1/4], we have the lower bound 



sup ¥ a * [a^a*]>l- V8T5 2 . 

a*e{f+5,i-5} 

Proof We observe first that for a G V, E Q *[|S — a*\] = 25P a * [a / a*], so that it suffices to lower 
bound the expected error. To ease notation, let Qi and Q_i denote the probability distributions 
indexed by a = \ + 5 and a = ^ — 5 respectively. By Lemma 1 of Yu [12] , we have 



sup E a .[|a-a*|] > 2${l - -Q-illiM- 



where we use the fact that |(l/2 + S) — (1/2 — S)\ = 25. Thus, we need to upper bound the total 
variation distance ||Qi — Q-i||i- From Pinkser's inequality |19j . we have 

(i) 



IIQi - Q-i||i < V2£>(Qi ||Q-i) < V32T5 2 , 

where inequality (i) follows from the calculation following Equation [26] (see proof of Lemma [3]), and 
uses our assumption that 5 G (0, 1/4]. Putting together the pieces, we obtain a lower bound on the 
probability of error 

E\a — a*\ i 

sup P[a + a*} = sup ^— > 1 - V8T5 2 , 

o«eV a*ev 2d 

as claimed. □ 
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Equipped with these tools, we are now prepared to prove our main results. 



4.2 Proof of Theorem [T] 

We begin with oracle complexity for bounded Lipschitz functions, as stated in Theorem [TJ We first 
prove the result for the set § = Boo (5). 

Part (a) — Proof for p £ [1,2]: Consider Oracle A that returns the quantities (jj a a(x) ,^a,A(x)) . 
By definition of the oracle, each round reveals only at most one coin flip, meaning that we can 
apply Lemma [3] with £ = 1, thereby obtaining the lower bound 



[a(M T ) ^ a] > 1 - 2 



1QT5 2 + log 2 
dlog(2/v^) ' 



(27) 



We now seek an upper bound P[S(A / 1t) 7^ o] using Lemma [2] In order to do so, we need to 
specify the base functions (/• , /~) involved. For i = 1, . . . , d, we define 



^ 1 
x(i) + - 



and f i (x) 



(28) 



Given that § = JBoo(±), we see that the minimizers of g a are contained in S. Also, both the 
functions are 1-Lipschitz in the £i-norm. By the construction (|16j) . we are guaranteed that for any 
subgradient of g a , we have 

1 1 "Sq^OzOHp < 2c for all p > 1. 

Therefore, in order to ensure that (fa, is L-Lipschitz in the dual £g-norm, it suffices to set c = L/2. 

Let us now lower bound the discrepancy function (I18h . We first observe that each function g a 
is minimized over the set BqoQ) at the vector x a := —a/2, at which point it achieves its minimum 
value 

c 

min g a (x) = - - c5. 
Furthermore, we note that for any a 7^ ft we have 

d 



9a{x) +gp{x) 



£ 

i=i 
d 



^ [(1 + a i( J + ft<5) f?(x) + (1 - a 4 5 - ft<5) fr( Q 



i=l 
d 



E t(/i + (*) + /f (*)) ^ ^) + ((! + 2 «^)/+(x) + (1 - 2a l 5)fr( x )) I(a< = ft 



When a« = ft then x Q (i) = x / g(i) = — aj/2, so that this co-ordinate does not make a contribution 
to the discrepancy function p(g a ,gp). On the other hand, when «j 7^ ft, we have 



ft(x) + fr( x ) 



+ 



*W " 2 



> 1 for all ieR. 
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Consequently, any such co-ordinate yields a contribution of 2cS/d to the discrepancy. Recalling our 
packing set (I15p with d/4 separation in Hamming norm, we conclude that for any distinct a 7= (3 
within our packing set, 

. . 2c5 , . c5 
P{ga,g(3) = — A H {a,/3) > —, 

so that by definition of ip, we have established the lower bound ip(5) > 

Setting the target error e := we observe that this choice ensures that e < Recalling 
the requirement 5 < 1/4, we have e < c/72. In this regime, we may apply Lemma [2] to obtain the 
upper bound F ^[S^A^t) 7^ a] < \- Combining this upper bound with the lower bound ([271) yields 
the inequality 

I 2 16T3 2 + log2 

3 " dlog(2/v^) " 

Recalling that c = |f, making the substitution 5 = — = and performing some algebra yields 

L 2 I ' d ( 2 \ \ L 2 d L 

T > c ^- - log — - log 2 > ci — =- for all d > 11 and for all e < , 

e \3 \V e J J e 144 

where cq and ci are universal constants. Combined with Theorem 5.3.1 of NY [5| (or by using the 
lower bound of Lemma |4] instead of Lemma [3]) , we conclude that this lower bound holds for all 
dimensions d. 

Part (b) — Proof for p > 2: The preceding proof based on Oracle A is also valid for p > 2, but 
yields a relatively weak result. Here we show how the use of Oracle B yields the stronger claim 
stated in Theorem QJb). When using this oracle, all d coin tosses at each round are revealed, so 
that Lemma [3] with t = d yields the lower bound 

P [S (X r )^l>l-2 ^ rf f 2 ;'°f 2 . (29, 

dlog(2/Ve) 

We now seek an upper bound on P[S(A / 1t) / a]. As before, we use the set S = B^i), 
and the previous definitions ([28]) of f^~(x) and f~{x). From our earlier analysis (in particular, 
equation ([25]) ). the quantity [|2" aj B(a;)[|p is at most cd 1 / p_1 , so that setting c = Ld l ~ l l p yields 
functions that are Lipschitz with parameter L. 

As before, for any distinct pair a, (3 G V, we have the lower bound 

2c5 c5 
P{9a,90) = -j- A H {a,P) > —, 

so that ip(5) > y. Consequently, if we set the target error e := y|, then we are guaranteed that 
e < -^p, as is required for applying Lemma [2] Application of this lemma yields the upper bound 
F^l^Aix) 7^ ot] < |. Combined with the lower bound (|29p . we obtain the inequality 

1 > 2 16dT^ 2 +log2 
3 " dlog(2/Vi) 
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Substituting 5 = 18e/c yields the scaling e > cq for all d > 11, e < c/72 and a universal constant 

Co- Recalling that c = Lc? 1_1//p , we obtain the bound (jlOp . Combining this with Theorem 5.3.1 
of NY [5 J (or by using the lower bound of Lemma U] instead of Lemma [3]) gives the claim for all 
dimensions. 



We have thus completed the proof of Theorem [T] in the special case S = Boo(^). In order to 
prove the general claims, which scale with r when B^r) C S, we note that our preceding proof 
required only that §> ~D Boo(^) so that the minimizing points x a = —a/2 £ § for all a (in particular, 
the Lipschitz constant of g a does not depend on S for our construction). In the general case, we 
define our base functions to be 



f: 



x(i) + 



and /■ (x) 



x(i) 



cd/2 — cr5. 



With this choice, the functions g a {x) are minimized at x a = —ra/2, and mf x£ § g a (x) 
Mimicking the previous steps with r = 1/2, we obtain the lower bound 

p(g«,gp)>— Va^(3ev. 

The rest of the proof above did not depend on §, so that we again obtain the lower bound T > cq Jj- 
or T > !§■ depending on the oracle used, for a universal constant cq. In this case, the difference in 
p computation means that e = < jxj, from which the general claims follow. 



4.3 Proof of Theorem [2] 

We now turn to the proof of lower bounds on the oracle complexity of the class of strongly convex 
functions from Definition In this case, we work with the following family of base functions, 
parametrized by a scalar 6 € [0,1): 

(1 — d) 2 (1 — 0) 2 
f^{x) = r9\x{i) + r\ H — (x(i) + r) , and f~(x)=r9\x(i) — r\-\ — (x(i) — r) . (30) 

A key ingredient of the proof is a uniform lower bound on the discrepancy p between pairs of these 
functions: 



Lemma 5. Using an ensemble based on the base functions (|30p . we have 



The proof of this lemma is provided in Appendix [S] Let us now proceed to the proofs of the main 
theorem claims. 



Part (a) — Proof for p = 1: We observe that both the functions f^,f7 are r-Lipschitz with 
respect to the || • ||i norm by construction. Hence, g a is cr-Lipschitz and furthermore, by the 
definition of Oracle A, we have E||z Q , )J 4(x)||f < c 2 r 2 . In addition, the function g a is (1 — #)c/(4<i)- 
strongly convex with respect to the Euclidean norm. We now follow the same steps as the proof of 
Theorem [H but this time exploiting the ensemble formed by the base functions (|30p . and the lower 
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bound on the discrepancy p(g a ,gp) from Lemma [5j We split our analysis into two sub-cases. 

Case 1: First suppose that 1 — 9 > 45/(1 + 25), in which case Lemma [5] yields the lower bound 

2c6 2 r 2 W c5 2 r 2 

p(Sa,S/j) > (1 _ g)d A g (a,/3) > 2{l _ e) Va//?GV, 

where inequality (i) uses the fact that Ah(<%, 0) > d/4 by definition of V. Hence by defini- 
tion of V) we have established the lower bound ip(6) > 2(1-0) • Setting the target error e := 
c5 2 r 2 /(18(l — 0)), we observe that this ensures e < ijj(6)/9. Recalling the requirement 6 < 1/4, we 
note that e < cr 2 /(288(l - 9)). In this regime, we may apply Lemma [2] to obtain the upper bound 
¥ ( p[di{M. T ) ^ a] < g. Combining this upper bound with the lower bound ([2^|) yields the inequality 

1 >1 167^+ log 2 gggg^ + log2 

3 " dlog(2/Vi) " dlog(2/v^) 

Simplifying the above expression yields that for d > 11, we have the lower bound 

/ flog(2/^)-log2 \ dlog(2/Vi) 
J " Cr 288e(l - 9) ) ~ ° r 288OOe(l-0)- 

Finally, we observe that L = cr and 7 2 = (1 — 6)c/(Ad) which gives 1 — 6 = Adrj 2 /L. Substituting 
the above relations in the lower bound (|32|) gives the first term in the stated result for d > 11. 

To obtain lower bounds for dimensions d < 11, we use an argument based on d = 1. For this 
special case, we consider / + and /~ to be the two functions of the single coordinate coming out 
of definition (I30p , The packing set V consists of only two elements now, corresponding to a = 1 
and a = — 1. Specializing the result of Lemma [5] to this case, we see that the two functions are 
2o5 2 r 2 /(l — 6) separated. Now we again apply Lemma [2] to get an upper bound on the error prob- 
ability and LemmaU]to get a lower bound, which gives the result for d < 11. 

Case 2: On the other hand, suppose that 1 — 9 < 45/(1 + 25). In this case, appealing to LemmaE] 
gives us that p(g a ,f3) > c5r 2 /4 for a / /? 6 V. Recalling that L = cr, we set the desired accuracy 
e := c5r 2 /36 = L5r/36. From this point onwards, we mimic the proof of Theorem [U doing so yields 
that for all 5 G (0, 1/4), we have 

d L 2 dr 2 

corresponding to the second term in Theorem Q] for a universal constant cq. 

Finally, the third and fourth terms are obtained just like Theorem Q] by checking the condition 
5 < 1/4 in the two cases above. Overall, this completes the proof for the case p = 1. 

Part (b) — Proof for p > 2: As with the proof of Theorem []Jb), we use Oracle B that returns 
(i-dimensional values and gradients in this case, with the base functions defined in equation [301 
With this choice, we have the upper bound 



n\ZaAx)\\ 2 p<C 2 d 2/p - 



2 2 

r , 
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so that setting the constant c = Ld 1 1 / p /r ensures that E||£ ai s(aj)||p < L 2 . As before, we have the 
strong convexity parameter 

2 _ c(l -9) _ Ld-VPjl - 9) 
7 ~ Ad 4r ' 

Also p(g a ,gp) is given by Lemma [5j In particular, let us consider the case 1 — 9 > 45/(1 + 25) 

x2 2 r2 2 

so that if>(5) > $l!q\ and we set the desired accuracy e := ^g^Lg) as before. With this setting 
of e, we invoke Lemma [2] as before to argue that P^S^A^t) ^ a ] — ^- To lower bound the error 
probability, we appeal to Lemma with £ = d just like Theorem QJb) and obtain the inequality 

1 2 16Q?r£ 2 + log2 

3 " dlog(2/Ve) 

f 2 2 

Rearranging terms and substituting e = jjn^n ? we obtain for d > 11 

r - co (^) =co (^y)' 

for a universal constant Co- The stated result can now be attained by recalling c = Ld l ~ l l p jr and 
7 2 = Ld- x /p(i _ Q)/r for 1 - 9 > 45/(1 + 25) and d > 11. For d < 11, the cases of p > 2 and 
p = 1 are identical up to constant factors in the lower bounds we state. This completes the proof 
for 1 - 9 > 4(5/(1 + 25). 

Finally, the case for 1 — 9 < 45/(1 + 25) involves similar modifications as part(a) by using the 
different expression for p(g a ,9/3)- Thus we have completed the proof of this theorem. 



4.4 Proof of Theorem [3] 

We begin by constructing an appropriate subset of J- sp (k) over which the Fano method can be 
applied. Let V(k) := {a 1 , . . . , a M } be a set of vectors, such that each a? G {—1, 0, +l} d satisfies 



||cr'||o = fc for all j = 1, . . . , M, and AH(a J ,a)>- for all j ^ t. 

It can be shown that there exists such a packing set with |V(&)| > exp (| log jnif) elements (e.g., 
see Lemma 5 in Raskutti et al. |22j). 

For any a G V(k), we define the function 



g a {x) := c 



E 

i=i 



h OLi5 

2 



x(i) + r 



+ | - - (Xj,5 ) \:r(i) 



x(i) 



i=i 



(33) 



In this definition, the quantity c > is a pre-factor to be chosen later, and 5 G (0, |] is a given 
error tolerance. Observe that each function g a G G(5; k) is convex, and Lipschitz with parameter c 
with respect to the || • ||oo norm. 

Central to the remainder of the proof is the function class G(5;k) := {g a , a G V(k)}. In 
particular, we need to control the discrepancy ifj(5;k) := tfj(G(5;k)) for this class. The following 
result, proven in Appendix [Bj provides a suitable lower bound: 
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Lemma 6. We have 



i>(5;k) = M p(g a ,gp)>——. (34) 

o^/3eV(fc) 4 

Using Lemma [6l we may complete the proof of Theorem [3l Define the base functions 

fi(x) := d (\x(i) + r\ + 5\x(i)\) , and f~(x) := d(\x(i) — r\ + 5\x(i)\) . 
Consider Oracle B, which returns (i-dimensional gradients based on the function 

d 

9*Ax) = - d Y. Wi?) + C 1 - M/r(*)L 
i=i 

where {b{\ are Bernoulli variables. By construction, the function g a< B is at most 3c-Lipschitz in 
£oo norm (i.e. ||Sct,s (^) || oo < 3c), so that setting c = ^ yields an L-Lipschitz function. 

Our next step is to use Fano's inequality [19] to lower bound the probability of error in the 
multiway testing problem associated with this stochastic oracle, following an argument similar to 
(but somewhat simpler than) the proof of Lemma [3j Fano's inequality yields the lower bound 

p[a^*]>i-i^ _ . (35) 

(As in the proof of Lemma O we have used convexity of mutual information [19] to bound it by the 
average of the pairwise KL divergences.) By construction, any two parameters a, (3 G V differ in 
at most 2k places, and the remaining entries are all zeroes in both vectors. The proof of Lemma [3] 
shows that for 5 6 [0, |], each of these 2k places makes a contribution of at most 165 2 . Recalling 
that we have T samples, we conclude that D(¥ a \\Fp) < 32kT5 2 . Substituting this upper bound 
into the Fano lower bound (i35j) and recalling that the cardinality of V is at least exp (| log y^f), 
we obtain 

By Lemma [6] and our choice c = L/3, we have 

ck5r Lk5r 



m > 



12 



Therefore, if we aim for the target error e = L ^ , then we are guaranteed that e < as is 

required for the application of Lemma [2j Recalling the requirement <5 < 1/4 gives e < Lk5r/A2>2. 



Now Lemma[2] implies that P[q?(A / 1t) 7^ a] < 1/3, which when combined with the earlier bound ([36 
yields 

1 / 32fcr^ 2 + log2 \ 

3~ V I^TJI )' 
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Rearranging yields the lower bound 

T Z co \ —gr- I = coUr k I , 

for a universal constant Co, where the second step uses the relation 5 = for k, d > 11. As long 
as k < \d/2\, we have log = © (^S^)' which gives the result for k,d > 11. The result for 
k,d < 11 follows Theorem [U(b) applied with p = oo, completing the proof. 



5 Discussion 

In this paper, we have studied the complexity of convex optimization within the stochastic first-order 
oracle model. We derived lower bounds for various function classes, including convex functions, 
strongly convex functions, and convex functions with sparse optima. As we discussed, our lower 
bounds are sharp in general, since there are matching upper bounds achieved by known algorithms, 
among them stochastic gradient descent and stochastic mirror descent. Our bounds also reveal 
various dimension-dependent and geometric aspects of the stochastic oracle complexity of convex 
optimization. An interesting aspect of our proof technique is the use of tools common in statistical 
minimax theory. In particular, our proofs are based on constructing packing sets, defined with 
respect to a pre-metric that measures how the degree of separation between the optima of different 
functions. We then leveraged information-theoretic techniques, in particular Fano's inequality and 
its variants, in order to establish lower bounds. 

There are various directions for future research. It would be interesting to consider the effect 
of memory constraints on the complexity of convex optimization, or to derive lower bounds for 
problems of distributed optimization. We suspect that the proof techniques developed in this 
paper may be useful for studying these related problems. 
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A Proof of Lemma [5] 

Let g a and gg be an arbitrary pair of functions in our class, and recall that the constraint set § is 
given by the ball B^r). From the definition (|18p of the discrepancy p, we need to compute the 
single function infimum 'ml xe9oo ^ g a (x), as well as the quantity ini x&oo ^{g a (x) + gp(xj}. 
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Evaluating the single function infimum: Beginning with the former quantity, first observe 
that for any x G B^r), we have 

\x(i) + r\ = x(i) + r and \x(i) — r\ = r — x(i). (37) 

Consequently, using the definition (|30p of the base functions, some algebra yields the relations 



1 



4 " XW + 4 



1 



-x(iY + 



1 + 30 



+ 

i 

2 



-rxu 



and 



4 w 4 
Using these expressions for f?~ and Z^ - , we obtain 



-rxii). 



Q + tf» + Q - «<*) /r(*) = ^(/^w + /r (*)) + - frw) 



hi(x) 



-x(i) 2 H — r 2 + (1 + 9)aiSrx(i). 



A little calculation shows that constrained minimum of the univariate function hi over the interval 
[— r, r] is achieved at 



*•(<) := 



-2aiSr(l+6) 
1-0 



ifi+|>25 
if TT0 < 25, 



where we have recalled that on takes values in { — 1,+1}. Substituting the minimizing argument 
x*(i), we find that the minimum value is given by 

f 1+30 r 2 _ s 2 r 2 (i+e) 2 -r i-e > q A 
Summing over all co-ordinates i € {1, 2, . . . , d}, we obtain 



inf g a (x) = - y2hi(x*(i)) 



S 2 r 2 c(l+9) 2 cr 2 (1+30) - f 1-0 ^ qx 
(1-0) 4 11 1+0 — A0 



l±£ cr 2 _ (! + ^) c ^ r 2 



if T+f < 25- 



(38) 



Evaluating the joint infimum: Here we begin by observing that for any two a, j3 G V, we have 

d 



X 



j=l 



— -x(i) 2 + -±-^r 2 + 2(1 + 0)a*5nr(i)I(<* = ft) 



(39) 



As in our previous calculation, the only coordinates that contribute to p(^cj a ^ gp) are the ones where 
a.i 7^ fa, and for such coordinates, the function above is minimized at x*(i) = 0. Furthermore, the 
minimum value for any such coordinate is (1 + 3#)cr 2 /(2d). 
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We split the remainder of our analysis into two cases: first, if we suppose that > 25, or 
equivalently that 1 — > 45/(1 + 25), then equation (fBTJj) yields that 



1=1 



l+3g 2^(1 + 0)3 

2 1-6 K y 



Combined with our earlier expression (|38p for the single function infimum, we obtain that the 
discrepancy is given by 

. 25 2 r 2 c(l + 6) 2 . . oN 25 2 r 2 c . , os 
p(9a,gp) = M , m A g («,/3) > — -A H (a,P). 



d(l-6) 



d(l - 6) 



On the other hand, if we assume that < 25, or equivalently that 1 — 6 < 45/(1 + 25), then 
we obtain 



,i n i« K(x)+5/3(x)} = ^ 

t=i 



1 + 36 



2(1 + 6)r 2 5 



1 



r 2 = ft) 



2 V 2 

Combined with our earlier expression (|38p for the single function infimum, we obtain 

P( 9a ,9p) = ~ (2(1 + 6)r 2 5 - i^r 2 ) A H (a, (3) > ° (1 ±£™ A H (a, /?), 

where step (i) uses the bound 1 — 6 < 25(1 + 6). Noting that 6 > completes the proof of the 
lemma. 



B Proof of Lemma [6] 



Recall that the constraint set § in this lemma is the ball Boo(r). Thus, recalling the definition (|18p 
of the discrepancy p, we need to compute the single function infimum inf^g^^) g a (x), as well as 
the quantity inf x&oo ( r) {g a (x) + gp{x)}. 



Evaluating the single function infimum: Beginning with the former quantity, first observe 
that for any x G B^r), we have 



1 

- + oiid 



\x(i) + r\ + 



1 r 

ctid 

2 



\x(i) — r\ = r + 2cti5x(i). 



(40) 



We now consider one of the individual terms arising in the definition (|16|) of the function g a . Using 
the relation (|40p . it can be written as 



\ ■ 'V^./Vj,-; . (I n ,5] 



i + otiS^J \x(i) + r\ + f ^ - a«5 j .r(/) - r + 5|.c( / ) 

'r + (2a; + l)5x(i) ifx(i)>0 
r + (2«i - l)8x(i) ifx(i)<0 
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From this representation, we see that whenever a, 7^ 0, then the i term in the summation 
defining g a minimized at x(i) = —roti, at which point it takes on its minimum value r(l — 5). On 
the other hand, for any term with ctj = 0, the function is minimized at x(i) = with associated 
minimum value of r. Combining these two facts shows that the vector —ar is an element of the set 
arg min^gg g a (x), and moreover that 

inf g a (x) = cr (d — kS) . (41) 



Evaluating the joint infimum: We now turn to the computation of mf x& ^ oo ^{g a (x) + gp{x)}. 
From the relation (140 p and the definitions of g a and gp, some algebra yields 



inf {g a {x) + gp{x)} = c inf V {2r + 25 [{on + fc)x(i) + \x{i)\}} . 

x& -ran ^— » 



.re 



(42) 



i=l 



Let us consider the minimizer of the i term in this summation. First, suppose that on 7^ fa, 
in which case there are two possibilities. 

• If ai 7^ Pi and neither ai nor /3j is zero, then we must have «j + /3j = 0, so that the minimum 
value of 2r is achieved at x(i) = 0. 



• Otherwise, suppose that «j 7^ and /3j = 0. In this case, we see from Equation (I42p that it 
is equivalent to minimizing aix{i) + Setting x(i) = —ai achieves the minimum value 
of 2r. 

In the remaining two cases, we have ai = (3i. 

• If ai = (3i 7^ 0, then the component is minimized at x(i) = —a^r and the minimum value 
along the component is 2r(l — 5). 

• If ai = Pi = 0, then the minimum value is 2r, achieved at x{i) = 0. 
Consequently, accumulating all of these individual cases into a single expression, we obtain 



■mt{ga(x) + gp(x)} = 2cr (d- 5 Vl[ Qi = ft ^ 0]) . 
Finally, combining equations (I4ip and (j43]) in the definition of p, we find that 



(43) 



2c5r 



d - 5^Tl[ai = pi ^ 0] - (d- k5) 



i=l 



fe-^I[a i = / Sj^0] 



i=l 



= cr5A H (a,/3), 

where the second equality follows since a and (3 have exactly k non-zero elements each. Finally, 
since V is an /c/2-packing set in Hamming distance, we have A#(a,/3) > k/2, which completes the 
proof. 
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C Upper bounds via mirror descent 



This appendix is devoted to background on the family of mirror descent methods. We first describe 
the basic form of the algorithm and some known convergence results, before showing that different 
forms of mirror descent provide matching upper bounds for several of the lower bounds established 
in this paper, as discussed in the main text. 



C.l Background on mirror descent 

Mirror descent is a generalization of (projected) stochastic gradient descent, first introduced by Ne- 
mirovski and Yudin [5] ; here we follow a more recent presentation of it due to Beck and Teboulle [23J . 
For a given norm || • ||, let $ : M. d — > 1RU {+00} be a differentiable function that is 1-strongly convex 
with respect to || • ||, meaning that 

®(y) > ®(x) + (V$(x), y — x) + -\\y — x\\ 2 . 

We assume that $ is a function of Legendre type [Ml [25], which implies that the conjugate dual 
<I>* is differentiable on its domain with V^* = (V<1?) . For a given proximal function, we let D$ 
be the Bregman divergence induced by given by 

D${x, y) := $(x) - $(y) - (V$(y), x-y). (44) 

With this set-up, we can now describe the mirror descent algorithm based on the proximal function 
$ for minimizing a convex function / over a convex set § contained within the domain of <£. Starting 
with an arbitrary initial xq G S, it generates a sequence {xj}^ contained within 8 via the updates 

x t+ i = argmin{?7 t (x, Vf(x t )) + D$(x,x t )\, (45) 

where % > is a stepsize. In case of stochastic optimization, Vf(xt) is simply replaced by the 
noisy version z(xt). 

A special case of this algorithm is obtained by choosing the proximal function <E>(x) = ^||x||2> 
which is 1-strongly convex with respect to the Euclidean norm. The associated Bregman divergence 
D$(x,y) = r;\\x — y\\\ is simply the Euclidean norm, so that the updates (|4"5|) correspond to 
a standard projected gradient descent method. If one receives only an unbiased estimate of the 
gradient V/(xt), then this algorithm corresponds to a form of projected stochastic gradient descent. 
Moreover, other choices of the proximal function lead to different stochastic algorithms, as discussed 
below. 

Obtaining explicit convergence rates for this algorithm can be obtained under appropriate con- 
vexity and Lipschitz assumptions for /. Following the set-up used in our lower bound analysis, 
we assume that E||Vz(x()||^ < I? for all x G S, where ||u||* := sup|| x || <1 (x, v) is the dual norm 
defined by || • ||. Given stochastic mirror descent based on unbiased estimates of the gradient, it 
can be showed that (see e.g., Chapter 5.1 of NY [5] or Beck and Teboulle [23]) with the initializa- 
tion xq = argmin xg s $(x) and stepsizes rjt = 1/Vi, the optimization error of the sequence {xt} is 
bounded as 



T 
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Note that this averaged convergence is a little different from the convergence of xt discussed in 
our lower bounds. In order to relate the two quantities, observe that by Jensen's inequality 

<^[f(xt)]. 

Consequently, based on mirror descent for T — 1 rounds, we may set xt = y^y Ylt=i x t so as to 
obtain the same convergence bounds up to constant factors. In the following discussion, we assume 
this choice of xt for comparing the mirror descent upper bounds to our lower bounds. 

C.2 Matching upper bounds 

Now consider the form of mirror descent obtained by choosing the proximal function 

* (a;) := : -\\xf a for 1< a < 2. (47) 

(a - 1) 

Note that this proximal function is 1-strongly convex with respect to the £ a -norm for 1 < a < 2, 
meaning that 

7 tt F L > 7 tt V a + V ? \\x \\ a {x-y) + - x- y\\ a . 

Upper bounds for dual setting: Let us start from the case 1 < p < 2. In this case we use 
stochastic gradient descent with , and the choice of p ensures that E||z(x)||| < E||£(x)||2 < I? (the 
second inequality is true by assumption of Theorem [T|) . Also a straightforward calculation shows 
that ||x*||2 < \x* \\ q d l l 2 ~ l l q so that we get the upper bound: 

E[f(x T )-f(x*)] = 

which matches the lower bound from Equation (jlip for this case. For p > 2, we use mirror descent 
with a = q = p/(p — 1). In this case, E||z(x)||p < L 2 and \\x*\\ q < 1 for the convex set B 9 (l) 
and the function class J- cv (B q (l), L,p). Hence in this case, the upper bound from Equation 1461 is 
0(L/\/r) as long as p = o(logd), which again matches our lower bound from Equation [TTJ Finally, 
for p = O(logd), we use mirror descent with a = 21ogd/(21ogd — 1), which gives an upper bound 
of 0{L^\og d/T) (since l/(o — 1) = O(logd) in this regime). 

Upper bounds for ball: For this case, we use mirror descent based on the proximal function 
<& a with a = q. Under the condition ||x*||oo < 1> a condition which holds in our lower bounds, we 
obtain 

1 1 1 1 q < ^~~* 1 1 *^ 1 1 ^ ^ — d ^ ^ 

which implies that & q (x*) = 0(d 2 ^ q ). Under the conditions of Theorem [TJ we have E||z(x t )||2 < L 2 
where p = q/(q — 1) defines the dual norm. Note that the condition 1 < q < 2 implies that p > 2. 
Substituting this in the upper bound (|4"6|) yields 

E[f{xr)-f(x*)]=o(L y/<P/*/T) = O^Ld 1 - 1 '?^, 




T 
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which matches the lower bound from Theorem [2(b) (we note that there is an additional log factor 
here just like the preceding discussion when p = O(logd) which we ignore here). 

For 1 < p < 2, we use stochastic gradient descent with q = 2, in which case ||x*||2 < yd and 
E||z(xt)||2 — ^ L 2 by assumption. Substituting these in the upper bound for mirror 

descent yields an upper bound to match the lower bound of Theorem [2(a) . 

Upper bounds for Theorem [3} In order to recover matching upper bounds in this case, we use 
the function $ a from Equation (H71) with a = 2 fog*!— l • ^ n * ms case J the resulting upper bound (j4*6j) 
on the convergence rate takes the form 

E Irtxr) - /<_•)] - O - O , (48) 

since —^r = 2 log d— 1. Based on the conditions of Theorem[3l we are guaranteed that x* is /c-sparse, 
with every component bounded by 1 in absolute value, so that ||a?*||^ < k 2 l a < k 2 , where the final 
inequality follows since a > 1. Substituting this upper bound back into Equation (fl8|) yields 

E [/(x T ) -/(.-)] = M. 

Note that whenever = 0(d 1_<5 ) for some 5 > 0, then we have logci = 0(log r), in which case this 
upper bound matches the lower bound from Theorem [3] up to constant factors, as claimed. 
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